Combined Iso-Seq and ChIP-Seq data for TSS determination

Zhigang Lu
18/06/2019

Processing Iso-Seq raw reads

We have PacBio Iso-Seq reads from male, female and somule. Reads were merged from different stages.

Raw reads –> Filter reads –> Merge overlapped reads –> Intersect with gene models

Cleaning up raw reads

  • low mappig quality (q < 30)
  • too short (< 20% of gene length)
  • too long (> longest gene)
  • multiple mapping
  • manually inspected

Might missing potential merging information but can refer back.

plot of chunk unnamed-chunk-1

Checking reads mapped to 2 genes

To remove:
Smp_000020,Smp_000030 2 Smp_000020 24 Smp_000030 157
Smp_000110,Smp_000130 1 Smp_000110 10 Smp_000130 48
To keep:
Smp_000700,Smp_000710 10 Smp_000700 4 Smp_000710 78
Smp_004980,Smp_213790 7 Smp_004980 0 Smp_213790 0
Smp_012400,Smp_012410 4 Smp_012400 0 Smp_012410 22

In total Iso-Seq reads mapped to 7121 genes

gene source gstart glength gstrand read rstart rlength rstdev diff
Smp_000020 AUGUSTUS 46644182 28795 + plus-SM_V7_1#486 46644174 28796 2977.250 -8
Smp_000030 AUGUSTUS 46620712 15557 + plus-SM_V7_1#485 46614735 16467 1511.700 -5977
Smp_000040 Apollo 46610461 19987 - minus-SM_V7_1#450 46610433 19960 1725.950 28
Smp_000050 Apollo 46565658 95034 - minus-SM_V7_1#449 46556314 85485 25912.000 9344
Smp_000070 Apollo 46461772 3444 + plus-SM_V7_1#484 46461142 4098 297.719 -630

rstdev is the standard deviation from start coordnates of merged Iso-Seq reads.

summary(geneiso$diff)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-71482.0    -21.0      1.0    437.8     98.0  96381.0 

Genes with the largest distances

Distances of Iso-Seq reads to gene TSS

plot of chunk unnamed-chunk-4

plot of chunk unnamed-chunk-5

brakes: -1000 -500 -200 -100 0 100 200 500 1000 counts: 93 172 229 2544 1850 372 291 159

H3K4me3 ChIP-Seq data

Data from Roquis et al 2015 PLoS NTD “The Epigenome of Schistosoma mansoni Provides Insight about How Cercariae Poise Transcription until Infection”

  • Two histone marks: H3K4me3 and H2K27me3
  • Five life stages: Mir, Sp1, Cerc, Somule, and Adult
  • Each with 2 biological replicates

H3K4me3 enriched around TSS (but low-resolution)

For H3K4me3, we found relatively sharp peaks with a mean peak maximum located 250 bp (or 1–2 nucleosomes) downstream of the transcription start site (TSS) of genes (Roquis et al 2015)

plot of chunk unnamed-chunk-6

37.7% in downstream 500 bp and 54% in downstream 750 bp.

Combined Iso-Seq and ChIP-Seq evidence

If defined a correct model as:

  • Iso-Seq difference < 200 bp
  • ChIP-Seq peak maximium within +500 bp

Combined distance matrix for tss determination

gene isoseq-gene chipseq-gene read chipseq-isoseq
Smp_000020 8 169 plus-SM_V7_1#486 177
Smp_000420 27 NA plus-SM_V7_3#1745 NA
Smp_000690 NA 394 NoIsoseq NA
Smp_000050 9344 9386 minus-SM_V7_1#449 42
Smp_000070 630 85 plus-SM_V7_1#484 235
Smp_000075 NA NA NoIsoseq NA
Smp_000200 1587 1718 plus-SM_V7_3#1757 3305

NA: no data available. Gene might not be transcribed at that stage.

Combined distance matrix for tss determination

gene isoseq-gene chipseq-gene read chipseq-isoseq group
Smp_000020 8 169 plus-SM_V7_1#486 177 A
Smp_000420 27 NA plus-SM_V7_3#1745 NA B
Smp_000690 NA 394 NoIsoseq NA B
Smp_000050 9344 9386 minus-SM_V7_1#449 42 C
Smp_000070 630 85 plus-SM_V7_1#484 235 C
Smp_000075 NA NA NoIsoseq NA D
Smp_000200 1587 1718 plus-SM_V7_3#1757 3305 D

  • A: correct (Isoseq dist < 100 with Chipseq support)
  • B: probably correct (with only one evidence)
  • C: needs curation (dist >=100 with Chipseq support)
  • D: not enough evidence (NA in at least one )

Grouping current gene models

  • A: correct (Isoseq dist < 100 with Chipseq support)
  • B: probably correct (with only one evidence)
  • C: needs curation (dist >=100 with Chipseq support)
  • D: not enough evidence (NA in at least one )

plot of chunk unnamed-chunk-7

Examples

Iso-Seq differences and groups

Smp_342880, Smp_144590

Sequence logo and motifs for genes with accurate TSS

Promoter region (-200 to +5) for genes with accurate TSS (Iso-Seq diff <= 5bp and ChIP-Seq within 250bp; 640 genes)

Top motifs using MEME

(Better to use genes with similar functions; can also use discovered motif to refer binding sites of similar genes with not enough evidence.)

Sequence logo and motifs for genes with inaccurate TSS

Promoter region for genes with wrong TSS (Isoseq diff > 500 & Chipseq diff > 500 & Isoseq-Chipseq < 250bp; 464 genes)

Top motifs:

(Possible that some TSSs are correct but incorrectly grouped?)

Limitations

  • focused on gene TSS, not considering alternative splicing
  • merging overlapped Iso-Seq reads can be inaccurate
  • H3K4me3 signals for intronic ncRNAs can affect the accuracy